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Method for cooing and decoding multimedia data 



FIELD OF THE INVENTION 

The invention relates to a method of coding a plurality of multimedia data 
comprising the following steps : 

an acquisition step, for converting said original multimedia data into one or 
5 several bitstreams ; 

a structuring step, for capturing the different levels of information in said 
y bitstream(s) by means of analysis and segmentation ; 

SJ - a description step, for generating description data of the obtained levels of 

l7z information ; 

10 a coding step, allowing to encode the description data thus obtained. 

Of! The invention also relates to corresponding computer-executable process steps, and to a 

y. method for decoding data that have been coded by means of said coding method. 

ftj BACKGROUND OF THE INVENTION 

2? 15 More and more digital broadcast services are now available, and it therefore 

appears as useful to enable a good exploitation of multimedia information resources by users, 
that generally are not information technology experts. Said multimedia information generally 
consists of natural and synthetic audio, visual, and object data, intended to be manipulated in 
view of operations such as streaming, compression and user interactivity, and the MPEG-4 
20 standard is one of the most agreed solutions to provide a lot of functionalities allowing to 
carry out said operations. The most important aspect of MPEG-4 is the support of 
interactivity by the concept of object, that designates any element of an audio-visual scene : 
the objects of said scene are encoded independently and stored or transmitted simultaneously 
in a compressed form as several bitstreams, the so-called elementary streams. The 
25 architecture of a typical MPEG-4 terminal, shown in Fig. 1, comprises the following elements 
(starting at the bottom of the figure, but the functionality "interactivity" means that said 
components may also be actuated in the reverse sense, from the terminal to the server or 
anyother type of transmitter) : 
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(a) a delivery or transport layer 1 1 also called "TransMux layer" and which is 
media independent - MPEG-4 data can be transporter on transport layers such as RTP 
(Internet), MPEG-2 transport streams, H.323, or ATM, for instance - and receives 
multiplexed streams of compressed data from a transmission (or storage) medium ; 

(b) a synchronization or elementary stream layer 12, also called "FlexMux layer", 
which receives FlexMux streams from the layer 11 and which is in charge of the 
synchronization and buffering of the compressed data : this layer receives the packetized 
streams delivered by the transport layer 1 1 and outputs elementary streams respectively 
corresponding to different multimedia objects and composed of access units ; 

(c) a media layer (or compression layer) 13, receiving the elementary streams 
from the layer 12 and performing the decoding of the data that are extracted from said layer 
12; 

(d) a composition and rendering stage 14, intended to build the final scene 
arrangement, and a display 15 of the obtained audiovisual scene. 

The specification of MPEG-4 include an object description framework 
intended to identify and describe the elementary streams (audio, video, etc. ..) and to 
associate them in an appropriate manner in order to obtain the scene description and to 
construct and present to the end user a meaningful multimedia scene : MPEG-4 models 
multimedia data as a composition of objects. However the great success of this standard 
contributes to the fact that more and more information is now made available in digital form. 
Finding and selecting the right information becomes therefore harder, for human users as for 
automated systems operating on audio- visual data for any specific purpose, that both need 
information about the content of said information, for instance in order to take decisions in 
relation with said content. 

The objective of the MPEG-7 standard, not yet frozen, will be to describe said 
content, i.e. to find a standardized way of describing multimedia material as different as 
speech, audio, video, still pictures, 3D models, or other ones, and also a way of describing 
how these elements are combined in a multimedia document. MPEG-7 is therefore intended 
to define a number of normative elements called descriptors D (each descriptor is able to 
characterize a specific feature of the content, e.g. the color of an image, the motion of an 
object, the title of a movie,...), description schemes DS (the Description Schemes define the 
structure and the relationships of the descriptors), description definition language DDL 
(intended to specify the descriptors and description schemes), and coding schemes for these 
descriptions (Fig.2 gives a graphical overview of these MPEG-7 normative elements and 
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their relation). Whether it is necessary to standardize descriptors and description schemes is 
still in discussion in MPEG. It seems however likely that at least a set of the most widely 
used will be standardized. 



*4J 



5 SUMMARY OF THE INVENTION 

It is therefore an object of the invention to propose a new descriptor (and a 
new, corresponding description scheme) intended to be very useful in relation with the 
MPEG-7 standard. 

To this end, the invention relates to a coding method as described in the 
10 introductory part of the description and in which said description step comprises : 

a defining sub-step provided for storing a set of descriptors related to said 
plurality of multimedia data ; and 
SJ a description sub-step, provided for selecting the description data to be coded 

jTJ in accordance with every level of information as obtained in the structuring step ; 

jP 15 and said set of descriptors includes at least a shape descriptor and a shape deformation 
yfi descriptor. 

^ The invention also relates, for their use in a coding device provided for 

^ encoding a plurality of multimedia data, to computer-executable process steps provided to be 

p stored on a computer-readable storage medium and comprising the following 

Sf 20 steps : 

an acquisition step, for converting said original multimedia data into one or 
several bitstreams ; 

a structuring step, for capturing the different levels of information in said 
bitstream(s) by means of analysis and segmentation ; 
25 a description step, for generating description data of the obtained levels of 

information ; 

a coding step, allowing to encode the description data thus obtained ; 
wherein said description step comprises : 

- a defining sub-step provided for storing a set of descriptors related to said / 
30 plurality of multimedia data ; and 

- a description sub-step, provided for selecting the description data to'be coded 
in accordance with every level of information as obtained in the structuring step ; 

and said set of descriptors includes at least a shape descriptor and a shape deformation 
descriptor. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention will now be described, by way of example, with 
reference to the accompanying drawings in which : 

Fig.l illustrates the architecture of an MPEG-4 terminal allowing to process 
and reconstruct an audiovisual interactive scene ; 

Fig.2 gives a graphical overview of MPEG-7 normative elements and their 
relation, and therefore defines the MPEG-7 environment in which users may then deploy 
other descriptors (either in the standard or, possibly, not in it) ; 

Fig.3 and 4 illustrate the coding and decoding methods according to the 

invention. 

DETAILED DESCRIPTION OF THE INVENTION 



invention, illustrated in Fig.3, comprises the following steps : an acquisition step (CONV), 
for converting the available multimedia data into one or several bitstreams, a structuring step 
(SEGM), for capturing the different levels of information in said bitstream(s) by means of 
analysis and segmentation, a description step, for generating description data of the obtained 
levels of information, and a coding step (COD), allowing to encode the description data thus 
obtained. More precisely, the description step comprises a defining sub-step (DEF), provided 
for storing a set of descriptors related to said plurality of multimedia data, and a description 
sub-step (DESC), for selecting the description data to be coded, in accordance with every 
level of information as obtained in the structuring step on the basis of the original multimedia 
data. The coded data are then transmitted and/or stored. The corresponding decoding method, 
illustrated in Fig.4, comprises the steps of decoding (DECOD) the signal coded by mearis of 
the coding method hereinabove described, storing (STOR) the decoded signal thus obtained, 
searching (SEARCH) among the data constituted by said decoded signal, on the basis of a 
search command sent by an user (USER), and sending back to said user the retrieval result of 
said search imthe stored data. 



content, the two ones proposed according to the invention are based on complex Fourier 
descriptors, in order to characterize a shape and its deformation in time, i.e to characterize a 
segmented moving object as more or less rigid. Indeed, much semantic information may be 
extracted from the shape of an object and its deformation in time. For example, in a video- 



The method of coding a plurality of multimedia data according to the 



Among the descriptors stored in relation with all the possible multimedia 
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surveillance application, the rigidity of a moving region allows to differentiate pedestrians 
from vehicles. However, when vehicles are driving away from the camera, the 2D shape 
changes due to the perspective effect. 

In order to cope with this possible variation of scale or translation, the 
proposed descriptor, that has therefore to be invariant to basic geometrical transformations 
and to be scalable in order to be able to describe the shape and its deformation with more or 
less precision, is based on complex Fourier descriptors, which are invariant by translation, 
rotation and scaling. Moreover, a compact shape deformation descriptor is then extracted by 
measuring the variability of the different frequencies in time. 

The definition of complex Fourier descriptors is the following. These 
descriptors consist in a lossless representation of a shape contour G. A contour is defined as a 
set of points surrounding a surface. Depending on the sampling, points are not necessarily 
connex. The length of the contour is the number of points used to describe it and therefore 
also depends on the sampling. Complex Fourier descriptors, that are an equivalent 
frequencial description and not a parametric representation, are defined by : 

( 2*i*;r*n*k > i 



z k =5X ex P 



N=l 



N 



0<k<N (1) 



where Z n = x n + iy stands for the coordinates of the n th point of G, written as a complex 
number (real part is absciss, imaginary part is ordinate), L stands for the length of G and N is 
the number of frequency bins. 

These descriptors have the same meaning as in signal processing : low 
frequencies, for k around O and n-1, give a coarse idea of the shape, while high frequencies, 
N 

for k around — , represent fine details. This means that if two contours are very similar but 

for small details or for a small local part, the first coefficients will be very close, whereas the 
last ones will be completely different. Besides, if the shape is not rigid, the shape contour will 
change and so do the first coefficients. Of course the last ones will change as well, but will 
not be significant. Hence, first coefficients aim at clustering shape contours. Intrinsically, 
complex Fourier descriptors are a scalable representation of the contour : 

Zo stands for the continuous component (DC or Direct Current) and represents 
the non-normalized centroid of the contour ; 

Zi is the radius of the circle whose surface is equivalent to that of the shape, 
which can be interpreted as a scale parameter ; 
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Z k and Z N . k , with 1 < k < N, have similar but opposite properties : k being the 

N 

number of actions regularly spaced on the unity circle, k, 1 < K < — represents the number 

N 

of tension actions on the unity circuit towards the outside, whereas k, h 1 < k < N 

2 

represents the number of pressure actions towards the inside ; 
5 - The phase 0 k of Z k locates the action on the circle. 

The properties of complex Fourier descriptors will be now recalled. Assume 
that a contour F\ is translated by T , rotated by <f> and scaled by a factor A to obtain r 2 , Ti 
and T 2 having the same number of points. Then, there exists a simple relation between the 
complex Fourier descriptors Z\ , 0 < k < N of T\ and Z\ , 0 < k < N of T 2 : 
10 Z\ = T + ^exp(i<z>)Z 1 K , 0<k<N (2) 

Making Z k invariant by translation, rotation and scaling is equivalent to cancelling the 
effects of T , </> and X . Said properties are : 

(a) translation invariance : T being a continuous component and being therefore 
contained in Zo, by not considering Zo, the set of coefficients {Z k , 1 < k < N} is translation 

15 invariant. 

(b) rotation and starting point invariance : 

By considering the set of coefficients { abs(Z k ), 1 < k < N } where abs() is the modulus of 
Z k { abs(Z k ), 1 < k < N } is phase invariant. As both the starting point and a rotation induce 
a move of the phase (in fact, a multiplication by exp(i 0 )), the descriptor is rotation and 

20 starting point invariant. 

(c) 'scale factor invariance : 
when focusing on : 

abs(Z£ ) = X abs(Z^ ), 1 < k < N 



abs 



l<k<N 



25 and finally considering the set of coefficients, the resulting descriptor is also scale invariant. 
Unfortunately, A is not known, but present in each Zt. It is chosen to normalize by one of the 
descriptors. Since Z\ is known to be a scale factor, each 
abs(Z k ), 1< k < N, will be divided by abs(Zi). 
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1 < k < N \ is translation, rotation and scale invariant. 



(d) contour length invariance : the equation (1) being established for two contours 

of same number of points, if their number of points differ, then their frequencial description 
will also differ. Their difference of length can be interpreted as a difference of sampling. To 
5 cancel the influence of length, one contour must be resampled to the length of the other. By 
choosing for L a power of two, one can take N = L, which makes the description also 
sampling invariant. As a matter of fact, if the contour T is downsampled from Li = 2 ml points 
to L2 = 2 , with m2< mi, then the first and last frequency bins of each descriptor will 

N 

correspond exactly to the same frequency, because the frequency lap — remains the same 

q 10 (and conversely for upsampling from Li = 2 ml point to L 2 = 2" 12 , 
C*s m2 < mi). 

I ^ (e) compaction property : 

»p { Z k , 0<k<N 0 uN-N 0 <k<N} will be a truncated list of the complete list of the N 

m complex Fourier descriptors necessary to describe the shape losslessly, and the resulting 

15 reconstructed shape will be a filtered version of the initial shape (the number N 0 , with 
l»s 1 < N 0 < N , of descriptors to retain depends on the complexity of the contour ; however, 50 

ffl % of all coefficients are necessary to obtain a well-reconstructed contour with very few 

q artifacts). 

(f) robustness to incomplete view : as experiments have shown that the Fourier 
20 descriptors are sometimes very similar, sometimes completely different (depending on the 

contour and the percentage of occlusion), it must be recognized that they are not robust to 
incomplete view. 

(g) scalability : complex Fourier descriptors are intrinsically scalable : the higher 
the frequency, the finest the description. 

25 These definition and properties being recalled, the shape descriptor according 

to the invention is now presented : 

(a) descriptor definition : 

The input data are a binary mask of an object sampled on a regular grid. The 
object has no holes, and is not a fractal object. Beforehand, the contour of the object must be 
30 extracted, then resampled for its number of points to be a power of two L 2 = 2 m . 

(b) specifications of the proposed shape descriptor : 



PHF 99.613 

8 14.11.2000 
The descriptor 'should not only contain the necessary information on the shape but also be a 
summary of the full information available at start. The following descriptor, especially usable 
in the MPEG-7 standard, is therefore proposed : 

Centroid (C x , C y ) : coordinates of the centroid of the contour. 
5 - Angle 8 : angle between horizontal and main axis of the contour. 

Size of the original contour N : size of the contour after resampling. 
Set of ordered Fourier coefficients Z k : set of invariant Fourier coefficients. 
Size of the Fourier coefficients set P : size of the preceding set, with 1< P < 
N, P being necessarily odd. 
10 - Scale : scale parameter. 

The corresponding C structure may be the following one : 
typedef struct Shape Descriptor { 
/* Centroid */ 
long center x; 
15 long center y; 

/* Angle */ 
float theta; 

/* Size of the original contour, after resampling (N) */ 
long size of contour; 
20 /* Set of Fourier coefficients */ 

float *Fourier Coefficients; 
/* Size of the set of Fourier coefficients (P) */ 
long size Fourier Descriptors Set ; 

}; 

25 (c) extraction of this shape descriptor : 

These are the steps which lead to a set of invariant Fourier coefficients : 

Compute the two eigenvectors and the two eigenvalues from the contour and 

store angle as the angle 9 between the eigenvector associated with the biggest eigenvalue and 

the horizontal (0 is known modulo n). 
30 - Compute the FFT on the resampled contour of size N, in order to obtain 

{ Z k , 0 < k < N }, and store centroid as : 

c = Re(Z 0 ) 
N 



10 



O 15 



20 



25 
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_ Im(Zo) 
y " N 

Take modulus of each {Z k ,l<k<N} 
Store scale as : 

N 

Divide each { Z k , 2 < k < N } by abs(Zi) and store as Fourier coefficients 
Zj, with 1 < j < N 

Depending on the application, choose the final number P out of N Fourier 
coefficients to keep. 



Store Fourier coefficients abs 



v 



Z|...Z p =abs 



±2. 



abs 



..abs 



in the following order 



'Z ^ 

2 



abs 



N--+1 
2 



V 



(d) matching : 

Given two sets of Fourier descriptors ©i and ©2 , in order to compare their similarity neither 
the position nor the angle, which do not characterize the shape itself and can be treated 
separately, will be taken into account. If the two sets are of different sizes, Pi and P2 
respectively, with for instance Pi < P2, then the first Pi Fourier coefficients of the two sets 
must be compared. Considering that for one set, values f©i(k) at each frequency bin of order k 
are of different order of magnitude, it is relevant, for each frequency bin, to normalize the 
difference of values between the two sets by the magnitude at the current frequency bin. To 
harmonize the difference of magnitude between frequencies, it has been chosen to sum 
relative errors between corresponding frequency bin values fei(k) of the two descriptors. 
Finally, it should be considered that the coarse structure (low frequencies) prevails over fine 
details (high frequencies), and a weighting function to(k), which privileges low frequency 
range at the expense of high frequency range, is therefore introduced and sets the influence of 
details in the final result. A will denote the dissimilarity function and A the corresponding 
similarity function. Return values are between 0 and 1. 

^2^(k)E(Z' ei (k),Z' e2 (k)) 

A ( 0„e 2 )=^ 



P = min(P,,P 2 ) 
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X - V 

E(x, y) = - si x > y 



= y~ x 

y 

p 



si y > x 



Q = ^2a>(k) 

k=l 

5 Sim = 1 - A(0 P 0 2 ) 

Similarly, the shape deformation descriptor according to the invention is now presented : 
(a) descriptor definition : 

The input data is a segmented video sequence of an unique object, i.e. a sequence of binary 
.]g masks. The shape descriptor of the contour at each frame will be computed and stored for 

^ 10 processing, as described above in "(c) extraction of the shape descriptor". This descriptor is 
UJ based upon the shape descriptor exposed above, with : 

3] - Normalized deviation of the scale : normalized deviation of the scale 

^ 5 parameter over the video sequence. 

H - Maximal size of the original contours : the maximal size of the original 

Ui 15 contour sizes N over the video sequence. N is an item of the shape descriptor. 

Normalized deviations of each Fourier coefficient a z ^ : normalized deviations 
of each Fourier coefficient over the video sequence. 

Size of the set of normalized deviations of each Fourier coefficient M : size of 
the preceding set. 
20 The corresponding C structure may be the following one : 
typedef struct ShapeDeformationDescriptor { 
/* Normalized deviation of scale */ 
float Deviation of Scale; 

/* Maximal size of the original contours in the video sequence (N max) 
25 */long Maximal Size of Original contours; 

/* Normalized deviation on Fourier coefficients */ 

float * Deviation of Fourier coefficients; 
/* Size of the set of normalized deviations of Fourier coefficients */ 
lng Size of Fourier Cefficients Set; 

30 } ; 
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(b) ' Extraction of this shape deformation descriptor : 

The deviation of the scale factor and of each Fourier coefficient over the video sequence is 
calculated by using the standard deviation. Dividing by the mean provides a normalization of 
the deviation. The size of the set of Fourier coefficients may vary along the video sequence, 
but as the frequency lap remains the same, the Fourier coefficient Z£ of i* frame will 
be averaged with the k th Fourier coefficient of j th frame. The steps are the following : 
Calculate the mean of scale over the video sequence, 

Calculate the mean of each Fourier coefficient Z k over the video sequence, 

Calculate the standard deviation of scale over the video sequence, 

Calculate the standard deviation of each Fourier coefficient Z k over the video 

sequence, 

Divide the standard deviation of scale by its mean, and store as <r scaIe , 
Divide each Z k by its mean, and store as a . 

(c) matching : 

Although a matching function is not relevant for this shape deformation descriptor, because 
shape deformation descriptors are not intended to be compared, such a function may however 
be provided. The following function quantifies the similarity between two shape deformation 
descriptors 0, and 0 2 . The number of normalized deviations of Fourier coefficients 
involved in the calculation depends on the sizes Mi and M2 of the two sets of normalized 
deviations that have be to compared. A weighting function co(k) privileges low frequency 
range at the expense of high frequency range, in order to set the influence of details in the 
final result. A will denote the dissimilarity function and a the corresponding similarity 
function. 



M 



A(0„0 2 ) = 




M = min(M], M2) 

co(k) = —^—r 
1 + k 2 
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k=l 

Sim = 1 - A(0,,0 2 ) = A 

The shape descriptor that has been proposed is appropriate for still images and 
the shape deformation descriptor for objects in video sequences. The shape descriptor is 
5 based on complex Fourier descriptors whose theory has been explained. It gives a frequencial 
description of the contour of the objects. First results show that the shape descriptor is both 
robust and discriminating. It is invariant by translation, rotation and scaling and also scalable. 
It handles resampling. Tests have even proved that downsampling increase matching scores. 
The dedicated matching function allows to set the degree of similarity between two objects, 
10 as explained in the paragraph "(d) matching", 
p This shape descriptor is used as a basis to characterize shape deformation in a 

^ video sequence and thus to define a percentage of variation of each Fourier coefficient. That 

"'"si 

fu is possible because of the meaningful interpretation of its frequencial description. First results 

j2 presented indicate that it is possible to evaluate how much a shape can be deformed, by 

15 looking at. Its normalized deviation appears to quantify the degree of deformation. Its value 

s is the deformation rate. Even if it is not designed for, this descriptor can be considered as a 

Li signature of the shape deformation and may be used in a query search in order to match 

W objects that get out of shape in the same way. 



20 



